Crawling Chinese-Myanmar Parallel Corpus: Automatic Collection, Screening and Cleaning Corpus
نویسندگان
چکیده
منابع مشابه
Automatic English - Chinese Parallel Corpus Acquisition and Sentences Extraction ⋆
There are lots of valuable resource on Internet which can provide with cross languages and cross areas parallel corpus. Some earlier methods are developed to do this mining work. However, they often use one feature only in the mining process. We use multiple reasonable features of parallel pages to acquire parallel corpus. At last, we also add a SVM classifier which utilize all the features to ...
متن کاملAutomatic Acquisition of Chinese-English Parallel Corpus from the Web
Abstract. Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. ...
متن کاملFocused Web Corpus Crawling
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not suitable or desirable for inclusion in web cor...
متن کاملAutomatic Reordering Rule Generation Based On Parallel Tagged Aligned Corpus for Myanmar-English Machine Translation
Reordering is important problem to be considered when translating between language pairs with different word orders. Myanmar is a verb final language and reordering is needed when it is translated into other languages which are different from Myanmar word order. In this paper, automatic reordering rule generation for Myanmar-English machine machine translation is presented. In order to generate...
متن کامل3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers
A high-quality parallel corpus needs to be manually created to achieve good machine translation for the domains which do not have enough existing resources. Although the quality of the corpus to some extent can be improved by asking the professional translators to translate, it is impossible to completely avoid making any mistakes. In this paper, we propose a framework for cleaning the existing...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IOP Conference Series: Materials Science and Engineering
سال: 2019
ISSN: 1757-899X
DOI: 10.1088/1757-899x/646/1/012046